Oğuzhan Ercan
x.com/oguzhannercan
To achieve better quality at low step size, they propose to distill along the student’s backward path instead of the forward path. Put differently,
rather than having the student mimic the teacher, They use the teacher to improve the student based on its current state of knowledge. we propose
a Shifted Reconstruction Loss that dynamically adapts the knowledge transfer from the teacher model. Specifically, the loss is designed to distill
global, structural information from the teacher at high time steps, while focusing on rendering fine-grained details and high-frequency components
at lower time steps. They propose noise correction, a training free inference time modification that enhances sample quality.
Commonly chosen s.t. xT is not pure noise during training, but rather contains low-frequency information leaked from x0. xt = αtx0 + σtxT, here
the leakage reason, any stochastic interpolant xt, t < T still contains information from the ground-truth sample via the first summand αtx0.
backward distillation eliminates information leakage at all time steps t, preventing the model from relying on a ground-truth signal. This is achieved
by simulating the inference process during training, which can also be interpreted as calibrating the student on its own upstream backward path.
They first perform backward iterations of the student model to obtain , then use this as input for
both the student and teacher models during training.
For distillation loss, they define shifted reconstruction loss which is designed such that for higher values of t, the target produced by the teacher
model displays global content similarity with the student output but with improved semantic text alignment; and for lower values of t, the target
image features enhanced fine-grained details while maintaining the same overall structure as the student’s prediction.
When t=T, which is pure noise, at that time step predicting the noise is not informative. So existing works propose predicting the velocity which is
the rate of change. Unfortunately, converting a model to velocity prediction requires extra training efforts. They present training free method, by
treating t = T as a unique case and replacing ϵΘ with the true noise xT , the update f is corrected.
Shifted Reconstruction Loss
Imagine Flash: Accelerating Emu Diffusion Models with Backward
Distillation 18 Apr 2024
https://arxiv.org/pdf/2405.05224